You suspect a casino coin is unfair. Let \(p\) be the probability of the coin landing on heads (1).
Problem: After five trials, you observe the sequence \([1, 1, 0, 0, 0]\). Please derive the Maximum Likelihood Estimate (MLE) for the probability of getting heads.
In a sequence of \(n\) independent Bernoulli trials, the likelihood function for \(p\) is: \[ L(p) = p^k (1-p)^{n-k} \]
where \(n=5\) and \(k=2\) (the number of heads).
To find the MLE, we maximize the log-likelihood \(l(p)\): \[ l(p) = 2 \ln(p) + 3 \ln(1-p) \]
Taking the first derivative with respect to \(p\) and setting it to zero: \[ \begin{aligned} \frac{dl}{dp} =˙ \frac{2}{p} - \frac{3}{1-p} = 0 \\ 2(1-p) =& 3p \implies 2 - 2p = 3p \implies 5p = 2 \\ hat{p}_{MLE} =& \frac{2}{5} = 0.4 \end{aligned} \]
This time we incorporate the prior on the coin having head (1) follows Beta distribution (i.e., \(p \sim B(2,8)\)). What will be the maximum a posterori estimation of the probability of getting the head ?
The Beta distribution is the conjugate prior for the Binomial likelihood. If the prior is \(\text{Beta}(\alpha, \beta)\) and we observe \(k\) successes in \(n\) trials, the posterior is: \[ p | \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k) \]
Plugging in the values (\(\alpha=2, \beta=8, k=2, n=5\)): \[ p | \text{data} \sim \text{Beta}(2 + 2, 8 + 3) = \text{Beta}(4, 11) \]
The MAP estimate is
\[ \hat{p}_{MAP} = \frac{4 - 1}{4 + 11 - 2} = \frac{3}{13} \approx 0.231 \]
Conclusion: The MAP estimate is \(\approx 0.231\), which shifts the MLE (\(0.4\)) toward the prior mean (\(0.2\)).
How does weak and stronger prior (let, say \(p \sim B(20,80)\), noted that the prior belief is stil at 0.2 chance to get head) affect the MAP estimation?
Both \(\text{Beta}(2, 8)\) and \(\text{Beta}(20, 80)\) have the same mean (\(0.2\)), but the latter has a much smaller variance, representing a “stronger” or more certain belief.
A stronger prior is more resistant to change from new data. Even though we observed \(40\%\) heads in our sample, the strong prior dominates the calculation, resulting in an estimate (\(0.204\)) much closer to the prior mean (\(0.20\)) than the weak prior estimate (\(0.231\)).
For a linear regression model (for the simplicity we do not comnsider the intercept in this case), \(y=x\beta_1+\epsilon\), where \(\epsilon \sim N(0,\sigma^2)\). Implicitly \(y \sim N(X\beta_1,\sigma^2)\)
Please show the log-likelihood for \(N\) of observations \((y_i,x_i)\) given \(\beta\) is
\[ l(\beta_1,\sigma^2,y,x) = -\frac{N}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-x_i\beta_1)^2 \]
Since \(\epsilon_i \sim N(0, \sigma^2)\), the response variable follows the distribution \(y_i \sim N(x_i\beta_1, \sigma^2)\). The probability density function (PDF) for a single observation is: \[ f(y_i | x_i, \beta_1, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - x_i\beta_1)^2}{2\sigma^2} \right) \]
Assuming the observations are independent and identically distributed (i.i.d.), the likelihood function \(L(\beta_1, \sigma^2)\) is the product of the individual densities: \[ \begin{aligned} L(\beta_1, \sigma^2) =& \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - x_i\beta_1)^2}{2\sigma^2} \right)\\ L(\beta_1, \sigma^2) =& (2\pi\sigma^2)^{-N/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i\beta_1)^2 \right) \end{aligned} \]
Taking the natural logarithm to find the log-likelihood \(l = \ln(L)\):
\[ \begin{aligned} l(\beta_1, \sigma^2) =& \ln \left[ (2\pi\sigma^2)^{-N/2} \right] + \ln \left[ \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i\beta_1)^2 \right) \right] \\ l(\beta_1, \sigma^2) =& -\frac{N}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - x_i\beta_1)^2 \end{aligned} \]
Following to question 2.1, please show that the MLE estimation for \(\beta_1\) is
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^N x_i^2} \]
To find the MLE \(\hat{\beta}_1\), we maximize the log-likelihood by taking the partial derivative with respect to \(\beta_1\) and setting it to zero:
Applying the chain rule
\[ \begin{aligned} \frac{\partial l}{\partial \beta_1} =& -\frac{1}{2\sigma^2} \sum_{i=1}^N 2(y_i - x_i\beta_1)(-x_i)\frac{\partial l}{\partial \beta_1} \\ =& \frac{1}{\sigma^2} \sum_{i=1}^N (x_i y_i - x_i^2 \beta_1) \end{aligned} \]
Set to Zero: \[ \frac{1}{\sigma^2} \left( \sum_{i=1}^N x_i y_i - \hat{\beta}_1 \sum_{i=1}^N x_i^2 \right) = 0 \]
Solve for \(\hat{\beta}_1\):
\[ \begin{aligned} \sum_{i=1}^N x_i y_i =& \hat{\beta}_1 \sum_{i=1}^N x_i^2 \\ \hat{\beta}_1 =& \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^Nx_i^2} \end{aligned} \]
The null model for linear model an intercept only model, that is \(y \sim N(\beta_0, \sigma^2)\). Please show the log-likelihood for \(N\) of observations under null model is
\[ l(\beta,\sigma^2,y,x) = -\frac{N}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-\beta_0)^2 \]
Replace \(x\beta_1\) to \(\beta_0\) in 2.1 you will get it
Please show that the Likelihood Ratio Test (LRT) of the null hypothesis \(H_0: \beta_1=0\) and alternative hypothesis \(H_1: \beta_1\neq 0\) is
\[ \frac{1}{\sigma^2}(\sum_{i=1}^N(y_i-x_i\beta_1)^2-\sum_{i=1}^N(y_i-\beta_0)^2) \]
The LRT statistic is \(-2 \ln(\frac{L(H_0)}{L(H_1)}) = -2[l(H_0) - l(H_1)]\)
Substitue \(l(H_0)\) from 2.3 and \(l(H_1)\) in 2.1, you will get the answer
In the machine learning perspective, we usually ask the model to minimize the mean square error (MSE, \((y_i-\hat{y_i})^2\), where \(\hat{y_i}\) is the predicted value, \(x_i\hat{\beta}_1\) in our case). Please describe why minimize MSE is equivalent to find an MLE.
In the MLE calculation, actually we are minimizing the function \((y_i-x_i\beta_1)^2\) which is exactly the definition of MSE
In typical linear regression, \(R^2\) or adjusted-\(R^2\) is usually a more common choice indicating how good is a model. Please describe what are the advantage of likelihood ratio test over \(R^2\)
Advantages include:
A scientist is studying the effect of a new drug on the expression level of a specific gene. They have measured the expression levels in two small groups of mice: Control group (\(n=3\)) and a Treatment group (\(m=3\)). The expression level of these mice
We want to test if the drug significantly increases gene expression in mean using a Permutation Test.
What is the null and alternative hypothesis of the test
Null Hypothesis (\(H_0\)): \(\mu_C = \mu_T\)
There is no difference in the distribution of gene expression between the Control and Treatment groups.
Alternative Hypothesis (\(H_1\)): \(\mu_T > \mu_C\)
This is a one-tailed test.
What is the test statistics \(\Delta_{obs}\)?
\[ \begin{aligned} \Delta =& \bar{X}_T - \bar{X}_C \\ \Delta_{obs} =& 20 - 12 = 8$ \]
Please list out at least 3 kind of permutation and its corresponding test-statistics (\(\Delta_{perm}\))
\[ T = \{18, 20, 22\}, C = \{10, 12, 14\} \implies \Delta = 8 \]
\[ T = \{10, 18, 22\}, C = \{12, 14, 20\} \implies \bar{X}_T = 16.67, \bar{X}_C = 15.33 \implies \Delta = 1.34 \]
If after these 25 permutations, only 1 (the original data) results in a difference \(\ge \Delta_{obs}\), what is the p-value?
The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the observed statistic, assuming \(H_0\) is true. \[ p = \frac{\text{Number of permutations where } \Delta \ge \Delta_{obs}}{\text{Total number of permutations}} = \frac{1}{25} = 0.025 \]
What is the assumption of permutation test?
Exchangeability: order of the observation does not matter, the only thing make the difference is the observed case is somehow you purposely to do that to make it so extreme
What is the advantages of permutation test?
It assumes no distribution
Suppose we have a set of independent and identically distributed (i.i.d.) observations \(X_1, X_2, \dots, X_n\) from a Normal distribution with known mean \(\mu\) and unknown variance \(\sigma^2\):
\[ X_i \sim N(\mu, \sigma^2) \]
Please show that MLE for the variance \(\sigma^2\) is: \[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \]
(Note: For this exercise, assume \(\mu\) is a known constant. If \(\mu\) were unknown, we would replace it with the sample mean \(\bar{X}\).)
The Likelihood is: \[ \begin{aligned} L(\sigma^2) =& \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(X_i - \mu)^2}{2\sigma^2} \right)L(\sigma^2) \\ =& \left( 2\pi\sigma^2 \right)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2 \right) \end{aligned} \]
To simplify the differentiation, we tcalculate the log-likelihood: \[ \ell(\sigma^2) = \ln L(\sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2 \]
Take the derivative with respect to our parameter of interest, \(\sigma^2\), and set it to zero: \[ \frac{d}{d(\sigma^2)} \ell(\sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (X_i - \mu)^2 = 0 \]
Multiply by \(2(\sigma^2)^2\): \[ \begin{aligned} -n\sigma^2 + \sum_{i=1}^n (X_i - \mu)^2 = 0 \\ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \end{aligned} \]
If you recall from the previous course, a more common approach on contingency table is Pearson’s \(\chi^2\) test. But this question we will do it in a likelihood ratio manner. We are investigating the effect of a new treatment and it’s effect oncuring cancer. We get the following contingency table
| Cured | Not cured | Marginal | |
|---|---|---|---|
| Treatment | \(O_{11}\) | \(O_{12}\) | \(O_{11}+O_{12}\) |
| Not Treatment | \(O_{21}\) | \(O_{22}\) | \(O_{21}+O_{22}\) |
| Marginal | \(O_{11}+O_{21}\) | \(O_{12}+O_{22}\) | \(N\) |
A contingency table with \(n\) total observations follows a Multinomial Distribution. The likelihood function for observing counts \(O_{ij}\) with probabilities of getting at each cell \(p_{ij}\) is: \[ L = \frac{n!}{\prod O_{ij}!} \prod p_{ij}^{O_{ij}} \]
If the tretment and outcome are not independent, then the MLE estimation of \(p_{ij}\) is \(\frac{O_{ij}}{N}\)
If the tretment and outcome are independent, then we restrict \(p_{ij}\) to \(\frac{\text{Row total}}{N}\frac{\text{Column total}}{N}\), so the expected count (\(E_{ij}\)) is \(p_{ij}\times N = \frac{\text{Row total}\times\text{Column total}}{N}\)
Please show the LRT test-statisitcs
\[ -2 \ln \lambda =-2 \sum O_{ij} \ln\left( \frac{E_{ij}}{O_{ij}} \right) \]
\[ \begin{aligned} -2\ln(\lambda) =& -2\ln\Bigg( \frac{\prod \left( \frac{E_{ij}}{n} \right)^{O_{ij}}}{\prod \left( \frac{O_{ij}}{n} \right)^{O_{ij}}}\Bigg) \\ =& -2 \Big(\sum O_{ij} \ln\left( \frac{E_{ij}}{n} \right) - \sum O_{ij} \ln\left( \frac{O_{ij}}{n} \right)\Big) \\ =& -2\sum\ln\left( \frac{E_{ij}/n}{O_{ij}/n} \right) \\ =& -2\sum O_{ij} \ln\left( \frac{E_{ij}}{O_{ij}} \right) \end{aligned} \]
In the \(2 \times 2\) contingency table please show that Pearson’s \(\chi^2\) test-statisitcs (\(X^2 = \sum \frac{(O_i - E_i)^2}{E_i}\)) is an approximation to LRT at 2-order Tyler expansion
Let \(\delta_i = O_i - E_i\) is deviation of the observed count from the expected count.
Note that \(\sum \delta_i = 0\) (i.e. \(\sum O_i = \sum E_i\)) because the sum of observed counts must equal the sum of expected counts.
Rewrite the formula in 2.1 by \(\frac{O_i}{E_i} = \frac{E_i + \delta_i}{E_i} = 1 + \frac{\delta_i}{E_i}\), then
\[ G = 2 \sum O_i \ln\left(1 + \frac{\delta_i}{E_i}\right) \]
The Taylor series expansion for \(\ln(1+x)\) around \(x=0\) is:
\[ \ln(1+x) \approx x - \frac{x^2}{2} + \frac{x^3}{3} - \dots \]
Applying this to our term \(\ln\left(1 + \frac{\delta_i}{E_i}\right)\), where \(x = \frac{\delta_i}{E_i}\)
\[ \ln\left(1 + \frac{\delta_i}{E_i}\right) \approx \frac{\delta_i}{E_i} - \frac{\delta_i^2}{2E_i^2} \]
Substitute this approximation back into the \(G\) formula:
\[ \begin{aligned} G \approx& 2 \sum O_i \left( \frac{\delta_i}{E_i} - \frac{\delta_i^2}{2E_i^2} \right) \\ = & 2 \sum (E_i + \delta_i) \left( \frac{\delta_i}{E_i} - \frac{\delta_i^2}{2E_i^2} \right) \\ = & 2 \sum \left( \delta_i - \frac{\delta_i^2}{2E_i} + \frac{\delta_i^2}{E_i} - \frac{\delta_i^3}{2E_i^2} \right) \end{aligned} \]
Ignore the higher-order term (\(\delta_i^3\)) as it becomes negligible when \(O_i\) is close to \(E_i\).
Simplify the remaining terms
\[ \begin{aligned} G \approx& 2 \left( \sum \delta_i + \sum \frac{\delta_i^2}{2E_i} \right) \\ =& 2 \left( 0 + \frac{1}{2} \sum \frac{\delta_i^2}{E_i} \right) \\ =& \sum \frac{\delta_i^2}{E_i}\\ =& \sum \frac{(O_i - E_i)^2}{E_i} \end{aligned} \]
We have discussed Pearson’s \(\chi^2\) test, LRT and Fisher exact test. Please comment about when to use which test.
On the other hand some other considerations